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Robust Coding Schemes for Indexing and Retrieval from Large Face Databases 

Chengjun Liu *and Harry Wechsler t 

Abstract This paper introduces two new coding schemes, the Probabilistic Reasoning Models (PRM) and the Enhanced FLD 
(Fisher Linear Discrimimant) Models (EFM), for indexing and retrieval from large image databases with applications to face 
recognition. The unifying theme of the new schemes is that of lowering the space dimension ("data compression") subject to 
increased fitness for the discrimination index. 
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1 Introduction 

The contents of digital libraries consist of increasing amounts of digital media and, in particular, pictorial information 
characteristic of still images and video data. Computational challenges associated to the large volumes of data involved require 
low-dimensional image features for indexing and retrieval to support efficient, robust, and scalable image matching. The range 
of applications is wide open and it covers amongst others security (biometrics and forensics), journalism, telecommunication, 
electronic commerce, and human computer interaction. This paper is concerned in particular with indexing and retrieval from 
large facial image databases with applications to face recognition. 

Learning to recognize visual objects, such as human faces, requires the ability to find meaningful patterns in spaces of very 
high dimensionality. Psychophisical findings indicate, however, that "perceptual tasks such as similarity judgment tend to be 
performed on a low-dimensional representation of the sensory data. Low dimensionality is especially important for learning, 
as the number of examples required for attaining a given level of performance grows exponentially with the dimensionality of 
the underlying representation space" [4]. Low-dimensional representations are also important when one considers the intrinsic 
computational aspect. Principal Component Analysis (PC A) [6] is the method behind the eigenfaces coding scheme [17] 
whose primary goal is to project the similarity judgment for face recognition in a low-dimensional space. One should be aware, 
however, that PC A driven coding schemes are optimal and useful only with respect to data compression and decorrelation of low 
(2nd) order statistics. The recognition aspect is not considered and one should thus not expect optimal performance for tasks 

such as face recognition when using such PCA-like coding schemes. To address this obvious shortcoming, one reformulates the 
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original problem as one where the search is still for low-dimensional patterns but is now also subject to a high discrimination 
index, characteristic of separable low-dimensional patterns. One solution proposed to solve the new problem is to use Fisher 
Linear Discriminant (FLD) [16] for the very purpose of achieving high separability between the different patterns in whose 
classification one is interested. Characteristic of this approach are recent schemes such as the Most Discriminating Features 
(MDF) [ 1 6] and the Fisherfaces [ 1 ]. 

As it is apparent that robust indexing and retrieval schemes require both low-dimensional representations and enhanced 
discrimination abilities, we address those complementary requirements using an approach similar to constrained optimization 
(regularization), where the task is that of lowering the space dimension subject to increased performance for the discrimination 
index. Towards that end we introduce two new coding schemes: the Probabilistic Reasoning Models (PRM) and the Enhanced 
FLD Models (EFM). Lowering the space dimension for these coding schemes is done using PCA. The coding schemes assess 
and modulate the discrimination index using criteria such as the Bayes classifier and specific tradeoffs between PCA and FLD. 

2 Probabilistic Reasoning Models (PRM) 

We present in this section two Probabilistic Reasoning Models (PRM) which combine PCA and the Bayes classifier, and 
then show their feasibility on the face indexing and retrieval problem. PCA is applied first for dimensionality reduction with the 
goal of signal approximation. The PRM use the within-class scatter to estimate the covariance matrix for each class in order to 
approximate the conditional pdf, and then apply the MAP rule as the classification criterion. The MAP decision rule optimizes 
the class separability in the sense of the Bayes error and should improve on PCA and FLD based encoding schemes, which 
utilize criteria not related to the Bayes error (Fukunaga [6]). 

2.1 Principal Component Analysis (PCA) 

One popular technique for feature selection and dimensionality reduction is PCA [6]. Let X G R N be a random vector 
representing an image, where N is the dimensionality of the image space. The covariance matrix of X is defined as 

Hx = E{[X-E{X)][X-E(X)]*} (1) 

where E(-) is the expectation operator, t denotes the transpose operation, and Ex G R NxN . The PCA of a random vector X 
factorizes the covariance matrix Ex into the following form 

Ex = $A$* with $ = [#i,02,...,0jv],A = diag{Xi,X 2i . . • ,\n} (2) 

where $ E R NxN is an orthonormal eigenvector matrix and A € R NxN a diagonal eigenvalue matrix with diagonal elements 
in decreasing order. 

An important property of PCA is its optimal signal reconstruction in the sense of minimum Mean Square Error (MSE) 
when only a subset of principal components, P = [<f>i,<t>2> • • • ,4*™) where m < N, are used to represent the original signal. 
Following this property, an immediate application of PCA is the dimensionality reduction 

Y = P l X (3) 
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PCA was first applied to reconstruct human faces by Kirby and Sirovich [8]. Turk and Pentland [ 1 7] further developed a well 
known face recognition method, known as eigenfaces, where the eigenfaces correspond to the eigenvectors associated with the 
dominant eigenvalues of the face covariance matrix. The eigenfaces define a feature space, or "face space", which drastically 
reduces the dimensionality of the original space, and face detection and identification are carried out in the reduced space. 

2.2 MAP Bayes Classification Rule 

The Bayes classifier yields the minimum error when the underlying probability density functions (pdf) are known. This 
error, called the Bayes error, is the optimal measure for feature effectiveness when classification is of concern, since it is a 
measure of class separability [6]. The a posteriori probability function of W{ given X is defined as 

where p(X\u)i) is the conditional pdf of u)i, andp(X) the mixture density. The MAP decision rule for the Bayes classifier is 

p(X|w 4 )P(wi) = max {p(X|w i )P(w i )}, X € uh (5) 

The image X is classified to w< of whom the A Posteriori probability given X is the largest among all the other classes. 

The Bayes classifier involves the estimation of the conditional pdf for each class and requires a large number of training 
samples in order to give accurate results. To derive an accurate density estimation nonparametrically is extremely difficult, 
especially in high-dimensional spaces [6]. As there are usually not enough samples, one would then assume a particular form 
for the pdf and convert the general density estimation into a parametric one. The within-class densities are usually modeled as 
normal distributions [10]. 

P( * |W<) = (2^f5jI75 x " *>'ST l (* - Mi)) (6) 

where Mi and are the mean and covariance matrix of class u)i, respectively. 

Under the parametric pdf (multivariate normal densities) assumption, Moghaddam and Pentland [11] used the Bayesian 
classifier to develop the Probabilistic Visual Learning (PVL) method. In contrast to previously PCA inspired methods which 
use PCA for dimensionality reduction, PVL works by using the eigenspace decomposition as an integral part of estimating the 
conditional pdf in the original high-dimensional image space. 

2.3 A Unified Statistical Framework 

The PRM to be detailed later in this section integrate PCA and the Bayes classifier. The first step is to use PCA for reducing 
the dimensionality of the original image space from N to m (Eq. 3): X G R N Y e R m where m < N. The rationale 
behind applying PCA first for dimensionality reduction instead of exploiting the Bayes classifier and the MAP rule directly 
on the original data is two-fold. On the one hand, the high-dimensionality of the original image space makes the parameter 
estimation very difficult, if not impossible, due to the fact that the high-dimensional space is mostly empty. This problem 
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of sparsity limits the success of direct Bayesian analysis in the original space, since the amount of training data needed to 
get reasonably low variance estimators becomes "ridiculously high" [7], On the other hand, it has been confirmed by many 
researchers that the PC A representation enjoys image constancy in the sense that it suppresses input noise [8], [12]. The second 
step of PRM is then to estimate the within-class density and under the normal probability distribution assumption this step is 
equivalent to estimate the within-class covariance matrices (Eq. 6). 

Estimating the covariance matrix E j in Eq. 6 with respect to each class is still difficult due to the limited number of samples 
for each class. Note that while the mixture covariance matrix is diagonal following PCA, the within-class covariance matrices 
are not necessarily diagonal. Using different assumptions on the derivation of the within-class covariance matrices but within a 
unified statistical framework leads to alternative indexing and retrieval methods as follows : 

a) . Assume the within-class covariance matrices to be unit ones: Ej = E/ = J m . Under this assumption the conditional pdf 
(Eq. 6) relaxes to p(Y\u)i) = {2n ) m /2 exp{-^(Y - Mif(Y - Mi)}, where M< = E(Y\vi). As a result, the MAP rule (Eq. 5) 
leads to a distance classifier which corresponds to the eigenfaces method used by Turk and Pentland [17]. 

b) . Following dimensionality reduction using PCA (Eq. 3), one would use FLD to derive the new feature set W = Pp> LD Y. 
In the FLD space, assume all the within-class covariance matrices to be unit matrices: E* = E/ = J m . Under this assumption 
the conditional pdf (Eq. 6) reduces to p(W\u)i) = {2n ) m /2 exp{-±{W - Mj)'(W - Mi)}, where Mi = E(W\u>i). Again 
the MAP rule (Eq. 5) leads to a distance classifier which corresponds to the Fisherfaces method as described by Belhumeur, 
Hespanha, and Kriegman [1]. 

c) . In the PCA space (Eq. 3), assume all the within-class covariance matrices are identical, diagonal: E* = E/ = 
diag {alkali • • • > a m}- Eacn diagonal element is estimated by the sample variance in the one dimensional PCA space. This 
case corresponds to our new method labeled PRM-1 (see Sect. 4.1). 

d) . Compute the within-class covariance matrix based on the within-class scatter E^ in the reduced PCA space. Diagonalize 
E w , and use the ordered diagonal elements as estimations of the within-class covariance matrices corresponding to those derived 
in c). This case corresponds to our new method labeled PRM-2 (see Sect. 4. 1). 

2.4 PRM-1 and PRM-2 Models 

The two probabilistic reasoning models, PRM-1 and PRM-2, utilize the within-class scatters to derive the estimations of 
within-class covariance matrices. The PRM- 1 model assumes the within-class covariance matrices are identical and diagonal 
under the assumption of c) in Sect. 2 

E i = E/ = dia(?{cr? > a|,...,a^} (7) 
Each component a\ is estimated by sample variance in the one dimensional PCA space 

•j-ztj^gftf'— )*} <8 > 

where yffi is the i-th element of the image Y^ which belongs to class Wfc, m*i the i-th element of M*, and M* = E(Y\uk)- 
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From Eqs. 7 and 6, it follows 



1 y> (Vj- m ij) 2 \ 

~ 2 h •? J 



(9) 



where Y = (j/i, J/2, • • • ,2/m) 4 - Thus the MAP rule (Eq. 5) specifies the following classifier (note that the priors are set to be 
equal). 

The PRM-2 model estimates the within-class scatter matrix E„; in the reduced PCA space as 



To avoid the explicit calculation of E w and to improve numerical accuracy, we calculate the singular value decomposition 
(SVD) of matrix Z, where E w = ZZ*. 

Z = USV* (12) 
where 1/ and V are orthonormal matrices, and 5 is a diagonal one 

S = diag{s 1 ,$ 2 ,.-.,$m} (13) 

with non-negative singular values as diagonal elements. Order the squared diagonal elements as 

(4)' 5 (2)> ' * • > S U)) = orJkr{4i 4, • • • . *m} (14) 
Finally, under the assumption of d) in Sect, 2, the within-class covariance matrix is derived as 

£/ = 5( 2) , . • . , 5( m) } (15) 

From Eqs. 15, 6 and 5 the MAP rule specifies another classifier (note that again the priors are set to be equal). 

2.5 Experimental Results 

After estimating the within-class covariance matrices using Eqs. 7, 8 (for PRM-1) and Eqs. 12, 15 (for PRM-2), indexing 
and retrieval as appropriate for face recognition tasks can be carried out using the Bayes decision rule (Eq. 5). The a prior 
probabilities are assigned values according to a prior knowledge, and are set to be equal in our experiments (Eqs. 10 and 16). 

The experimental data consists of 1,107 facial images corresponding to 369 subjects and comes from the US Army FERET 
database [14]. 600 out of the 1,107 images correspond to 200 subjects with each subject having three images — two of them 
are the first and the second shot, and the third shot is taken under low illumination. For the remaining 169 subjects there are 
also three images for each subject, but two out of the three images are duplicates taken at a different time. Two images of each 
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subject are used for training with the remaining image for testing. The images are cropped to the size of 64 x 96, and the eye 
coordinates are manually detected. 

Fig. 1 shows the comparative performance of eigenfaces, Fisherfaces, PRM-1 and PRM-2, and one can see that the PRM 
consistently better the eigenfaces and Fisherfaces methods. In particular, the PRM class (PRM-1 and PRM-2) increases the 
peak retrieval rate by 5% when compared to eigenfaces and Fisherfaces methods; the peak retrieval rate for both PRM is about 
96% using 44 features. 

3 Enhanced FLD Models (EFM) 

We present in this section two Enhanced FLD Models (EFM) in order to improve the generalization capability of FLD based 
classifiers. Similar to Fisherfaces, both EFM models (EFM-1 and EFM-2) apply first PCA for dimensionality reduction before 
proceeding with FLD analysis. EFM-1 implements the dimensionality reduction with the goal to balance between the need that 
the selected eigenvalues account for most of the spectral energy of the raw data and the requirement that the eigenvalues of the 
within-class scatter matrix in the reduced PCA space are not too small. EFM-2 implements the dimensionality reduction as the 
Fisherfaces method does. It proceeds then with the whitening of the within-class scatter matrix in the reduced PCA space and 
then chooses a small set of features (corresponding to the eigenvectors of the within-class scatter matrix) so that the smaller 
trailing eigenvalues are not included further in the computation of the between-class scatter matrix. 

3.1 Fisher Linear Discriminant (FLD) 

Fisher Linear Discriminant (FLD) is a widely used technique for feature selection. Let ui , u)i , . . . , cjl and N\ , N 2 , . . - , Nl 
denote the classes and the number of images within each class, respectively. L is the number of the classes. Let Mi , M 2 , . . . , Ml 
and M be the means of the classes and the grand mean. Let S w and by the within- and between-class scatter matrices. The 
standard FLD procedure derives a projection matrix * that maximizes the ratio I^E&^I / |**E W *|. This ratio is maximized 
when $ consists of the eigenvectors of the matrix E~ 1 Efc [6]. 

5£ l £ 6 » = »A (17) 

where $,A6 R NxN are the eigenvector and eigenvalue matrices of E~ r Ee>. 

FLD overcomes one of PCA's drawbacks since it can distinguish within- and between-class scatters. Furthermore, FLD 
induces non-orthogonal projection axes, a characteristic known to have great functional significance in biological sensory 
systems [3]. The drawback of FLD is that it requires large sample size for good generalization. As this requirement is rarely 
met, FLD overflts and thus generalizes poorly when compared against PCA schemes [9] . One possible remedy for this drawback 
is to artificially generate additional data and thus increase the sample size [5]. 

FLD is behind several face recognition methods [16], [1], [5]. As the original image space is high dimensional, most meth- 
ods first perform dimensionality reduction using PCA, as it is the case with the Fisherfaces method developed by Belhumeur, 
Hespanha, and Kriegman [1]. Using similar arguments, Swets and Weng [16] point out that the eigenfaces derive only the Most 
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Expressive Features (MEF). As explained earlier, such PCA inspired features, do not necessarily provide for good discrimina- 
tion. As a consequence, subsequent FLD projections are used to build the Most Discriminant Features (MDF) classification 
space. The MDF space is, however, superior to the MEF space for face recognition only when the training images are represen- 
tative of the range of face (class) variations; otherwise, the performance difference between the MEF and MDF is not significant 
[16]. 

3.2 The Standard FLD Based Methods and Overfitting 

The standard FLD based methods apply first PCA for dimensionality reduction and then FLD for discriminant analysis [16], 
[1]. Relevant questions concerning PCA representation are usually related to the range of principal components used and how 
it affects performance. Regarding the FLD discriminant analysis one has to understand the reasons for overfitting and how to 
avoid it. For image representation, the more principal components are used the better the quality of reconstruction. The same 
reasoning does not apply, however, for class discrimination. One can actually show that using more principal components may 
lead to decreased classification performance [9]. The explanation for such behavior is that the trailing eigenvalues (resulting 
from the more principal components used) correspond to high-frequency components and usually encode noise. As a result, 
when these trailing but small valued eigenvalues are used to define the reduced PCA space, the FLD procedure has to fit for 
noise as well and as a consequence overfitting takes place. 

The FLD procedure involves the simultaneous diagonalization of the two within- and between-class scatter matrices and it 
is stepwise equivalent to two operations as pointed out by Fukunaga [6]: first whitening the within-class scatter matrix, and 
second applying PCA on the between-class scatter matrix using the transformed data. The purpose of the whitening step is to 
normalize (to unity) the within-class scatter matrix for uniform gain control. The second operation maximizes then the between- 
class scatter to separate different classes as much as possible. As during whitening the eigenvalues of the within-class scatter 
matrix appear in the denominator, the small (trailing) eigenvalues cause the whitening step to fit for misleading variations and 
thus generalize poorly when exposed to new data. The EFM, detailed in the following subsections, successfully address the 
overfitting problem and as a result would display increased generalization capabilities. 

3.3 EFM-1 

For EFM-1, one wants to choose m, the number of principal components in Eq. 3, such that a proper balance is maintained 
between the need that the selected eigenvalues account for most of the spectral energy of the raw data and the requirement that 
the eigenvalues of the within-class scatter matrix (in the reduced PCA space) are not too small. The eigenvalue spectrum of 
PCA is derived by Eq. 2, and the relative magnitude of the eigenvalues for the face data used in our experiments (see Sect. 2.5) 
is shown in Fig. 2. The index for the eigenvalues ranges from 1 to m = L, where L stands for the number of (face) classes 
considered in our experiments. One can see from Fig. 2 that the first 50 eigenvalues capture most of the energy and that the 
eigenvalues whose indices are greater than 100 are fairly small and most likely capture noise. 

To calculate the eigenvalue spectrum of the within-class scatter matrix in the reduced PCA space one needs to decompose 
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the FLD procedure into two operations: whitening and diagonal ization [6]. In particular, let E'^, E'j, € E mXm represent the 
within- and between-class scatter matrices in the reduced PCA space, respectively. The stepwise FLD procedure derives the 
eigenvalues and eigenvectors of E'^E'f, as the result of the simultaneous diagonalization of E'^ and E' 6 . First whiten the 
within-class scatter matrix 

T-^E'E^ST- 1 / 2 = / (18) 
where T,E6 E mxm are the eigenvalue and eigenvector matrices of E^ 

E' W S = ET and E'E = I (19) 

The eigenvalue spectrum of the within-class scatter matrix in the reduced PCA space can be derived by Eq. 19, and different 
spectra are obtained corresponding to different number of principal components that are utilized as shown in Fig. 3. Now 
one has to simultaneously optimize the behavior of the trailing eigenvalues in the reduced PCA space (Fig. 3) with the energy 
criteria for the original image space (Fig. 2). Note that different choices on the cutoff principal component index for the original 
image space yield different within-class spectra as shown in Fig. 3. As one can see from Fig. 2 and 3, a suitable choice is to 
set m = 50, since this choice not only accounts for most of the spectral energy of the raw data (Fig. 2) but also meets the 
requirement that the eigenvalues of the within-class scatter matrix (in the reduced 50 dimensional PCA space) are not too small 
(Fig. 3). 

After the number of principal components is set, EFM-1 computes the between-class scatter matrix K b as 

K b = X~ 1 / 2 E*E / & ET~ 1 / 2 (20) 
Diagonalize the new between-class scatter matrix K b 

K b e = QA and 9*0 = 7 (21) 
The overall transformation matrix (in the reduced PCA space) for EFM-1 is now defined as 

T = ET" 1 / 2 0 (22) 

3.4 EFM-2 

The EFM-2 model uses the same number m of principal components as utilized by Fisherfaces [1] and MDF method [16] 
for dimensionality reduction, namely, L < m < n — L , where n is the number of training samples and L the number of 
classes. Note that in our experiments (see Sect. 2.5) the number of classes L = 369 and the number of training images 
n = 738, therefore m = 369. The relative magnitude of the eigenvalues derived from the FLD whitening step in them = 369 
dimensional PCA space is shown in Fig. 4. The EFM-2 proceeds then by choosing a small set of features (corresponding 
to the eigenvectors of the within-class scatter matrix — see Eq. 19) after the whitening procedure so that the smaller trailing 
eigenvalues are not included. 
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Let s (s < m) be the number of features chosen in the whitening space according to the eigenvalue spectrum of E'^ (see 
Fig. 4, a suitable choice is to set s = 100), and a new projection matrix E* e R mx * is derived 

5* = [&, 6,. ..,&] (23) 

where £i,&>- . . are the eigenvectors of H' w corresponding to the leading eigenvalues 71,72, ... ,7*. The new diagonal 
matrix is 

r* =<fcap{7i,72,...,7s} (24) 
The new whitening transformation matrix in the reduced s dimensional whitened space is 

Q = (5*)(r*)- 1 / 2 (25) 

The between-class scatter matrix in this space is transformed to K£ 

K* b = Q%Q (26) 

Note that the new between-class scatter matrix K% is s x s now instead ofmxm. Diagonalize K£ 

K* b e*=e*&* and 0**8*=/ (27) 

The overall transformation matrix (after PCA) for EFM-2 is 

T* = H*r*- 1/2 0* (28) 

3.5 Experimental Results 

The experimental data used is as described in Sect. 2.5. Fig. 5 shows the comparative performance for Fisherfaces and our 
new EFM-1 and EFM-2 models. For the Fisherfaces, the principal components is optimally set to L, because L <m < n — L 
and n = 2L in our experiments (the number of classes L = 369, the number of training images n = 738). The EFM methods 
increase the correct retrieval rate by 10% to 15% compared to Fisherfaces. 

4 Conclusions 

We introduced in this paper two new coding schemes, the Probabilistic Reasoning Models (PRM) and the Enhanced FLD 
Models (EFM), for indexing and retrieval from large image databases with applications to face recognition. Experimental results 
using in excess of one thousand face images from the FERET facial image database show the feasibility and the robustness 
of our approach. Robustness is shown in terms of both absolute performance indices and comparative performance against 
traditional face recognition methods such as eigenfaces and Fisherfaces. Even though we have used the FERET digital face 
library to demonstrate the robustness of our encoding schemes, our general approach is not restricted to face images and can be 
applied to other digital image libraries as well. The methods provide robust coding schemes for image content-based indexing 
and retrieval based on image features derived in very low dimensional space. Because of the low dimensionality of the feature 
space, these methods are suitable for real time indexing and retrieval of large databases. 
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Figure 1. Comparative testing performance for the eigenfaces, the Fisherfaces (using 200 principal 
ponents), and the PRM method (PRM-1, PRM-2). 
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Figure 3. The relative magnitude of the eigenvalues spectra derived using the FLD whitening step with 
different number of principal components (m = 30, m = 40, and m = 50) for dimensionality reduction 
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Figure 5. The comparative performance for Fisherfaces, EFM-1 and EFM-2 models 
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